Intercom-style communication using multiple computing devices

ABSTRACT

Techniques are described related to improved intercom-style communication using a plurality of computing devices distributed about an environment. In various implementations, voice input may be received, e.g., at a microphone of a first computing device of multiple computing devices, from a first user. The voice input may be analyzed and, based on the analyzing, it may be determined that the first user intends to convey a message to a second user. A location of the second user relative to the multiple computing devices may be determined, so that, based on the location of the second user, a second computing device may be selected from the multiple computing devices that is capable of providing audio or visual output that is perceptible to the second user. The second computing device may then be operated to provide audio or visual output that conveys the message to the second user.

BACKGROUND

Humans may engage in human-to-computer dialogs with interactive softwareapplications referred to herein as “automated assistants” (also referredto as “chat bots,” “interactive personal assistants,” “intelligentpersonal assistants,” “personal voice assistants,” “conversationalagents,” etc.). For example, humans (which when they interact withautomated assistants may be referred to as “users”) may providecommands, queries, and/or requests using spoken natural language input(i.e. utterances) which may in some cases be converted into text andthen processed, and/or by providing textual (e.g., typed) naturallanguage input.

In some cases, automated assistants may include automated assistant“clients” that are installed locally on client devices and that areengaged directly by users, as well as cloud-based counterpart(s) thatleverage the virtually limitless resources of the cloud to helpautomated assistant clients respond to users' queries. For example, theautomated assistant client may provide, to the cloud-basedcounterpart(s), an audio recording of the user's query (or a textconversion thereof) and data indicative of the user's identity (e.g.,credentials). The cloud-based counterpart may perform various processingon the query to return various results to the automated assistantclient, which may then provide corresponding output to the user. For thesakes of brevity and simplicity, the term “automated assistant,” whendescribed herein as “serving” a particular user, may refer to theautomated assistant client installed on the particular user's clientdevice and any cloud-based counterpart that interacts with the automatedassistant client to respond to the user's queries.

Many users may engage automated assistants using multiple devices. Forexample, some users may possess a coordinated “ecosystem” of computingdevices that includes one or more smart phones, one or more tabletcomputers, one or more vehicle computing systems, one or wearablecomputing devices, one or more smart televisions, and/or one or morestandalone interactive speakers, among other more traditional computingdevices. A user may engage in human-to-computer dialog with an automatedassistant using any of these devices (assuming an automated assistantclient is installed). In some cases these devices may be scatteredaround the user's home or workplace. For example, mobile computingdevices such as smart phones, tablets, smart watches, etc., may be onthe user's person and/or wherever the user last placed them (e.g., at acharging station). Other computing devices, such as traditional desktopcomputers, smart televisions, and standalone interactive speakers may bemore stationary but nonetheless may be located at various places (e.g.,rooms) within the user's home or workplace.

Techniques exist to enable multiple users (e.g., a family, co-workers,co-inhabitants, etc.) to leverage the distributed nature of a pluralityof computing devices to facilitate intercom-style spoken communicationbetween the multiple users. However, these techniques are limited tousers issuing explicit commands to convey messages to explicitly-definedcomputing devices. For example, a first user who wishes to convey amessage to a second user at another location out of earshot (e.g., inanother room) must first determine where the second user is located.Only then can the first user explicitly invoke an intercom communicationchannel to a computing device at or near the second user's location, sothat the first user can convey a message to the second user at thesecond user's location. If the first user does not know the seconduser's location, the first user may be forced to simply cause themessage to be broadcast at all computing devices that are available forintercom-style communication. Moreover, if the first user is unawarethat the second user is not within earshot (e.g., the first user iscooking and didn't notice the second user leaving the kitchen), thefirst user may not realize that intercom-style communication isnecessary, and may speak the message to an empty room.

SUMMARY

Techniques are described herein for improved intercom-stylecommunication using a plurality of computing devices distributed aboutan environment such as a house, an apartment, a place of business, etc.For example, techniques are described herein for enabling determinationof location(s) of multiple users within the environment, so that (i) itcan be determined automatically whether an intended recipient of aspoken message is within earshot of the speaker, and (ii) a suitablecomputing device near the intended recipient can be identified and usedto output the message so that the intended recipient receives it.Additionally, techniques are described herein for automaticallydetermining whether a user utterance constitutes (a) a command to invokean automated assistant for normal use; (b) an attempt to convey a spokenmessage to another user that may potentially require the intercom-stylecommunication described herein; and/or (c) other backgroundnoise/conversation that requires no action. Additionally, techniques aredescribed herein for allowing a recipient of an intercom-style messagereceived using disclosed techniques to issue a request (e.g., a searchquery or other commands to an automated assistant such as orderingpizza, playing a song, etc.) that is processed (e.g., using naturallanguage processing) based at least in part on the initial messageconveyed by the speaker.

In various implementations, users' locations may be determined within anenvironment or area by computing devices configured with selectedaspects of the present disclosure using various techniques. For example,one or more computing devices may be equipped with various types ofpresence sensors, such as passive infrared (“PIR”) sensors, cameras,microphones, ultrasonic sensors, and so forth, which can determinewhether a user is nearby. These computing devices can come in variousforms, such as smart phones, standalone interactive speakers, smarttelevisions, other smart appliances (e.g., smart thermostats, smartrefrigerators, etc.), networked cameras, and so forth. Additionally oralternatively, other types of signals, such as signals emitted by mobilecomputing devices (e.g., smart phones, smart watches) carried by users,may be detected by other computing devices and used to determine theusers' locations (e.g., using time-of-flight, triangulation, etc.). Thedetermination of a user's location within an environment for utilizationin various techniques described herein can be contingent on explicituser-provided authorization for such determination. In variousimplementations, users' locations may be determined “on demand” inresponse to determining that a user utterance constitutes an attempt toconvey a spoken message to another user that may require intercom-stylecommunication. In various other implementations, the users' locationsmay be determined periodically and/or at other interval and mostrecently determined locations utilized in determining whether anintended recipient of a spoken message is within earshot of a speaker ofthe spoken message and/or in identifying a suitable computing devicenear an intended recipient of the spoken message.

As one example, a variety of standalone interactive speakers and/orsmart televisions may be distributed at various locations in a home.Each of these devices may include one or more sensors (e.g., microphone,camera, PIR sensor, etc.) capable of detecting a nearby human presence.In some embodiments, these devices may simply detect whether a person ispresent. In other embodiments, these devices may be able to not onlydetect presence, but distinguish the detected person, e.g., from otherknown members of a household. Presence signals generated by thesestandalone interactive speakers and/or smart televisions may becollected and used to determine/track where people are located at aparticular point in time. These detected locations may then be used forvarious purposes in accordance with techniques described herein, such asdetermining whether an utterance provided by a speaker is likely to beheard by the intended recipient (e.g., whether the speaker and intendedrecipients are in different rooms or the same room), and/or to selectwhich of the multiple speakers and/or televisions should be used tooutput the utterance to the intended recipient.

In another aspect, techniques are described herein for automaticallydetermining whether a user utterance constitutes (a) a command to invokean automated assistant for normal use; (b) an attempt to convey a spokenmessage to another user that may potentially require the intercom-stylecommunication described herein; and/or (c) other backgroundnoise/conversation that requires no action. In some implementations, amachine learning classifier (e.g., neural network) may be trained usingtraining examples that comprise recorded utterances (and/or features ofrecorded utterances) that are classified (labeled) as, for instance, acommand to convey a message to another user using an intercom-stylecommunication link, a command to engage in a conventionalhuman-to-computer dialog with an automated assistant, or conversationthat is not directed to an automated assistant (e.g., backgroundconversation and/or noise).

In some embodiments, speech-to-text (“STT”) may not be performedautomatically on every utterance. Instead, the machine learningclassifier may be trained to recognize phonemes in the audio recordingof the voice input, and in particular to classify the collectivephonemes with one of the aforementioned labels. For example,conventional automated assistants are typically invoked using one ormore invocation phrases. In some cases, a simple invocation machinelearning model (e.g., classifier) is trained to distinguish theseinvocation phrases from anything else to determine when a user invokesthe automated assistant (e.g., to recognize phonemes associated with“Hey, Assistant”). With techniques described herein, the same invocationmachine learning model or a different machine learning model may be(further) trained to classify utterances as being intended to convey amessage to another user, which may or may not require use ofintercom-style communications described herein. In some implementations,such a machine learning model may be used, e.g., in parallel with aninvocation machine learning model or after the invocation machinelearning model determines that the user is not invoking the automatedassistant, to determine whether the user may benefit from usingintercom-style communication to cause a remote computing device toconvey a message to another user.

In some implementations, a machine learning model may be trained, or“customized,” so that it is possible to recognize names spoken by a userand to attach those names with other individuals. For example, anautomated assistant may detect a first utterance such as “Jan, can youpass me the salt?” The automated assistant may detect a secondutterance, presumably from Jan, such as “Sure, here you go.” From theseutterances and the associated phonemes, the automated assistant maylearn that when a user makes a request to Jan, it should locate theindividual with Jan's voice. Suppose that later, Jan is talking on thephone in a separate room. When the user says something like “Jan, whereare my shoes,” the automated assistant may determine from this utterance(particularly, “Jan, . . . ”) that the utterance contains a message forthe individual, Jan. The automated assistant may also determine that Janis probably out of earshot, and therefore the message should be conveyedto Jan as an intercom message. By detecting Jan's voice on a nearbyclient device, the automated assistant may locate Jan and select thenearby client device to output the speaker's message.

In other implementations, a user may invoke an automated assistant usingtraditional invocation phrases and then explicitly command the automatedassistant to cause some other computing device to output a message to beconveyed to a recipient. The other computing device may be automaticallyselected based on the recipient's detected location as described above,or explicitly designated by the speaking user.

In yet another aspect, techniques are described herein for allowing arecipient of an intercom-style message received using disclosedtechniques to leverage context provided in the received intercom messageto perform other actions, such as issuing a search query or a command toan automated assistant. For example, after perceiving a conveyedintercom message, the recipient may issue a search query, e.g., at thecomputing device at which she received the conveyed intercom message oranother computing device. Search results may then be obtained, e.g., byan automated assistant serving the second user, that are responsive tothe search query. In some implementations, the search results may bebiased or ranked based at least in part on content of the originallyconveyed intercom message. Additionally or alternatively, in someimplementations, the recipient's search query may be disambiguated basedat least in part on content of the originally conveyed intercom message.

In some implementations in which an initial utterance is used to providecontext to downstream requests by a recipient user, to protect privacy,the original speaker's utterance may be transcribed (STT) only if it isdetermined that the recipient makes a downstream request. If therecipient simply listens to the message and does nothing further, no SSTmay be performed. In other implementations, the original speaker'sutterance may always be processed using SST (e.g., on determination thatthe utterance is to be conveyed through intercom-style communication),but the resulting transcription may be stored only locally and/or for alimited amount of time (e.g., long enough to give the recipient userample time to make some downstream request).

In some implementations, one or more computing devices may wait till anintended user is able to perceive a message (e.g., within earshot) tillthey convey a message using techniques described herein. For example,suppose a first user conveys a message to an intended recipient but theintended recipient has stepped outside momentarily. In someimplementations, the first computing device to detect the recipient upontheir return may output the original message.

In some implementations, a method performed by one or more processors isprovided that includes: receiving, at a microphone of a first computingdevice of a plurality of computing devices, from a first user, voiceinput; analyzing the voice input; determining, based on the analyzing,that the first user intends to convey a message to a second user;determining a location of the second user relative to the plurality ofcomputing devices; selecting, from the plurality of computing devices,based on the location of the second user, a second computing device thatis capable of providing audio or visual output that is perceptible tothe second user; and causing the second computing device to exclusivelyprovide audio or visual output that conveys the message to the seconduser (e.g., only the second computing device provides the output, to theexclusion of other computing devices).

These and other implementations of technology disclosed herein mayoptionally include one or more of the following features.

In various implementations, the analyzing may include applying an audiorecording of the voice input as input across a trained machine learningmodel to generate output, wherein the output indicates that the firstuser intends to convey the message to the second user. In variousimplementations, the machine learning model may be trained using acorpus of labelled utterances, and wherein labels applied to theutterances include a first label indicative of a command to convey amessage to another user and a second label indicative of a command toengage in a human-to-computer dialog with an automated assistant. Invarious implementations, labels applied to the utterances may furtherinclude a third label indicative of background conversation.

In various implementations, the selecting may be performed in responseto a determination, based on the location of the second user, that thesecond user is not within earshot of the first user. In variousimplementations, the location of the second user may be determined basedat least in part on one or more signals generated by a mobile computingdevice operated by the second user. Persons skilled in the art willappreciate from reading the specification that the concepts and subjectmatter described herein may ensure that messages are conveyed to, andreceived by, an intended person in a manner which is efficient for thetechnical equipment used to convey and receive the messages. This mayinclude the messages being conveyed and delivered for perception by theintended person at an appropriate time, so that the messages can beproperly understood by the intended person and there is no requirementfor messages to be re-conveyed/re-received by the technical equipmentfor this purpose. The technical equipment may include the multiplecomputing devices referred to above, as well as a network over which themessages may be conveyed between the devices. The efficiency in themanner in which, and the times at which, messages are conveyed mayresult in at least more efficient use of the network between thecomputing devices and also more efficient use of the computationalresources, within the computing devices, which are employed to conveyand receive the messages.

In various implementations, the location of the second user may bedetermined based at least in part on one or more signals generated byone or more of the plurality of computing devices other than the firstcomputing device. In various implementations, the one or more signalsmay include a signal indicative of the second user being detected by oneor more of the plurality of computing devices other than the firstcomputing device using passive infrared or ultrasound. In variousimplementations, the one or more signals may include a signal indicativeof the second user being detected by one or more of the plurality ofcomputing devices other than the first computing device using a cameraor a microphone.

In various implementations, the analyzing may include determining thatthe voice input includes an explicit command to convey the message tothe second user as an intercom message via one or more of the pluralityof computing devices. In various implementations, the analyzing mayinclude performing speech-to-text processing on the voice input togenerate textual input, and performing natural language processing onthe textual input to determine that the user intends to convey themessage to the second user.

In various implementations, the method may further include: identifyinga search query issued by the second user after the audio or visualoutput is provided by the second computing device; obtaining searchresults that are responsive to the search query, wherein the obtainingis based at least in part on the voice input from the first user; andcausing one or more of the plurality of computing devices to provideoutput indicative of at least some of the search results.

In addition, some implementations include one or more processors of oneor more computing devices, where the one or more processors are operableto execute instructions stored in associated memory, and where theinstructions are configured to cause performance of any of theaforementioned methods. Some implementations also include one or morenon-transitory computer readable storage media storing computerinstructions executable by one or more processors to perform any of theaforementioned methods.

It should be appreciated that all combinations of the foregoing conceptsand additional concepts described in greater detail herein arecontemplated as being part of the subject matter disclosed herein. Forexample, all combinations of claimed subject matter appearing at the endof this disclosure are contemplated as being part of the subject matterdisclosed herein.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1A is a block diagram of an example environment in whichimplementations disclosed herein may be implemented.

FIG. 1B schematically depicts one example of how a trained classifiermay be applied to generate output based on user utterances and/orlocations, in accordance with various implementations.

FIGS. 2, 3, and 4 depict example dialogs between various users andautomated assistants, including intercom-style communications, inaccordance with various implementations.

FIG. 5 depicts a flowchart illustrating an example method according toimplementations disclosed herein.

FIG. 6 illustrates an example architecture of a computing device.

DETAILED DESCRIPTION

Now turning to FIG. 1A, an example environment in which techniquesdisclosed herein may be implemented is illustrated. The exampleenvironment includes a plurality of client computing devices 106 _(1-N).Each client device 106 may execute a respective instance of an automatedassistant client 118. One or more cloud-based automated assistantcomponents 119, such as a natural language processor 122, may beimplemented on one or more computing systems (collectively referred toas a “cloud” computing system) that are communicatively coupled toclient devices 106 _(1-N) via one or more local and/or wide areanetworks (e.g., the Internet) indicated generally at 110. Also, in someembodiments, the plurality of client devices 106 _(1-N) may becommunicatively coupled with each other via one or more local areanetworks (“LANs,” including Wi-Fi LANs, mesh networks, etc.).

In some implementations, plurality of client computing devices 106_(1-N) (also referred to herein simply as “client devices”) may beassociated with each other in various ways in order to facilitateperformance of techniques described herein. For example, in someimplementations, plurality of client computing devices 106 _(1-N) may beassociated with each other by virtue of being communicatively coupledvia one or more LANs. This may be the case, for instance, whereplurality of client computing devices 106 _(1-N) are deployed across aparticular area or environment, such as a home, a building, a campus,and so forth. Additionally or alternatively, in some implementations,plurality of client computing devices 106 _(1-N) may be associated witheach other by virtue of them being members of a coordinated ecosystem ofclient devices 106 that are operated by one or more users (e.g., anindividual, a family, employees of an organization, other predefinedgroups, etc.).

As noted in the background, an instance of an automated assistant client118, by way of its interactions with one or more cloud-based automatedassistant components 119, may form what appears to be, from the user'sperspective, a logical instance of an automated assistant 120 with whichthe user may engage in a human-to-computer dialog. Two instances of suchan automated assistant 120 are depicted in FIG. 1A. A first automatedassistant 120A encompassed by a dashed line serves a first user (notdepicted) operating first client device 106 ₁ and includes automatedassistant client 118 ₁ and one or more cloud-based automated assistantcomponents 119. A second automated assistant 120B encompassed by adash-dash-dot line serves a second user (not depicted) operating anotherclient device 106 _(N) and includes automated assistant client 118 _(N)and one or more cloud-based automated assistant components 119. It thusshould be understood that each user that engages with an automatedassistant client 118 executing on a client device 106 may, in effect,engage with his or her own logical instance of an automated assistant120. For the sakes of brevity and simplicity, the term “automatedassistant” as used herein as “serving” a particular user will refer tothe combination of an automated assistant client 118 executing on aclient device 106 operated by the user and one or more cloud-basedautomated assistant components 119 (which may be shared amongst multipleautomated assistant clients 118). It should also be understood that insome implementations, automated assistant 120 may respond to a requestfrom any user regardless of whether the user is actually “served” bythat particular instance of automated assistant 120.

The client devices 106 _(1-N) may include, for example, one or more of:a desktop computing device, a laptop computing device, a tabletcomputing device, a mobile phone computing device, a computing device ofa vehicle of the user (e.g., an in-vehicle communications system, anin-vehicle entertainment system, an in-vehicle navigation system), astandalone interactive speaker, a smart appliance such as a smarttelevision, and/or a wearable apparatus of the user that includes acomputing device (e.g., a watch of the user having a computing device,glasses of the user having a computing device, a virtual or augmentedreality computing device). Additional and/or alternative clientcomputing devices may be provided.

In various implementations, one or more of the client computing devices106 _(1-N) may include one or more presence sensors 105 _(1-N) that areconfigured to provide signals indicative of detected presence,particularly human presence. Presence sensors 105 _(1-N) may come invarious forms. Some client devices 106 may be equipped with one or moredigital cameras that are configured to capture and provide signal(s)indicative of movement detected in their fields of view. Additionally oralternatively, some client devices 106 may be equipped with other typesof light-based presence sensors 105, such as passive infrared (“PIR”)sensors that measure infrared (“IR”) light radiating from objects withintheir fields of view. Additionally or alternatively, some client devices106 may be equipped with presence sensors 105 that detect acoustic (orpressure) waves, such as one or more microphones.

Additionally or alternatively, in some implementations, presence sensors105 may be configured to detect other phenomena associated with humanpresence. For example, in some embodiments, a client device 106 may beequipped with a presence sensor 105 that detects various types of waves(e.g., radio, ultrasonic, electromagnetic, etc.) emitted by, forinstance, a mobile client device 106 carried/operated by a particularuser. For example, some client devices 106 may be configured to emitwaves that are imperceptible to humans, such as ultrasonic waves orinfrared waves, that may be detected by other client devices 106 (e.g.,via ultrasonic/infrared receivers such as ultrasonic-capablemicrophones).

Additionally or alternatively, various client devices 106 may emit othertypes of human-imperceptible waves, such as radio waves (e.g., Wi-Fi,Bluetooth, cellular etc.) that may be detected by one or more otherclient devices 106 and used to determine an operating user's particularlocation. In some implementations, Wi-Fi triangulation may be used todetect a person's location, e.g., based on Wi-Fi signals to/from aclient device 106. In other implementations, other wireless signalcharacteristics, such as time-of-flight, signal strength, etc., may beused by various client devices 106, alone or collectively, to determinea particular person's location based on signals emitted by a clientdevice 106 they carry.

Additionally or alternatively, in some implementations, one or moreclient devices 106 may perform voice recognition to recognize anindividual from their voice. For example, some automated assistants 120may be configured to match a voice to a user's profile, e.g., forpurposes of providing/restricting access to various resources. In someimplementations, movement of the speaker may then be tracked, e.g., byone or more other presence sensors that may be incorporated, forinstance, in lights, light switches, smart thermostats, securitycameras, etc. In some implementations, based on such detected movement,a location of the individual may be predicted, and this location may beassumed to be the individual's location when other individual (i.e., aspeaker) provides an utterance with a message for the first individual.In some implementations, an individual may simply be assumed to be inthe last location at which he or she engaged with automated assistant120, especially if no much time has passed since the last engagement.

Each of the client computing devices 106 _(1-N) may operate a variety ofdifferent applications, such as a corresponding one of a plurality ofmessage exchange clients 107 _(1-N). Message exchange clients 107 _(1-N)may come in various forms and the forms may vary across the clientcomputing devices 106 _(1-N) and/or multiple forms may be operated on asingle one of the client computing devices 106 _(1-N). In someimplementations, one or more of the message exchange clients 107 _(1-N)may come in the form of a short messaging service (“SMS”) and/ormultimedia messaging service (“MMS”) client, an online chat client(e.g., instant messenger, Internet relay chat, or “IRC,” etc.), amessaging application associated with a social network, a personalassistant messaging service dedicated to conversations with automatedassistant 120, and so forth. In some implementations, one or more of themessage exchange clients 107 _(1-N) may be implemented via a webpage orother resources rendered by a web browser (not depicted) or otherapplication of client computing device 106.

As described in more detail herein, automated assistant 120 engages inhuman-to-computer dialog sessions with one or more users via userinterface input and output devices of one or more client devices 106_(1-N). In some implementations, automated assistant 120 may engage in ahuman-to-computer dialog session with a user in response to userinterface input provided by the user via one or more user interfaceinput devices of one of the client devices 106 _(1-N). In some of thoseimplementations, the user interface input is explicitly directed toautomated assistant 120. For example, one of the message exchangeclients 107 _(1-N) may be a personal assistant messaging servicededicated to conversations with automated assistant 120 and userinterface input provided via that personal assistant messaging servicemay be automatically provided to automated assistant 120. Also, forexample, the user interface input may be explicitly directed toautomated assistant 120 in one or more of the message exchange clients107 _(1-N) based on particular user interface input that indicatesautomated assistant 120 is to be invoked. For instance, the particularuser interface input may be one or more typed characters (e.g.,@AutomatedAssistant), user interaction with a hardware button and/orvirtual button (e.g., a tap, a long tap), an oral command (e.g., “HeyAutomated Assistant”), and/or other particular user interface input.

In some implementations, automated assistant 120 may engage in a dialogsession in response to user interface input, even when that userinterface input is not explicitly directed to automated assistant 120.For example, automated assistant 120 may examine the contents of userinterface input and engage in a dialog session in response to certainterms being present in the user interface input and/or based on othercues. In many implementations, automated assistant 120 may engageinteractive voice response (“IVR”), such that the user can uttercommands, searches, etc., and the automated assistant may utilizenatural language processing and/or one or more grammars to convert theutterances into text, and respond to the text accordingly. In someimplementations, the automated assistant 120 can additionally oralternatively respond to utterances without converting the utterancesinto text. For example, the automated assistant 120 can convert voiceinput into an embedding, into entity representation(s) (that indicateentity/entities present in the voice input), and/or other “non-textual”representation and operate on such non-textual representation.Accordingly, implementations described herein as operating based on textconverted from voice input may additionally and/or alternatively operateon the voice input directly and/or other non-textual representations ofthe voice input.

Each of the client computing devices 106 _(1-N) and computing device(s)operating cloud-based automated assistant components 119 may include oneor more memories for storage of data and software applications, one ormore processors for accessing data and executing applications, and othercomponents that facilitate communication over a network. The operationsperformed by one or more of the client computing devices 106 _(1-N)and/or by automated assistant 120 may be distributed across multiplecomputer systems. Automated assistant 120 may be implemented as, forexample, computer programs running on one or more computers in one ormore locations that are coupled to each other through a network.

As noted above, in various implementations, each of the client computingdevices 106 _(1-N) may operate an automated assistant client 118. Invarious embodiments, each automated assistant client 118 may include acorresponding speech capture/text-to-speech (“TTS”)/STT module 114. Inother implementations, one or more aspects of speech capture/TTS/STTmodule 114 may be implemented separately from automated assistant client118.

Each speech capture/TTS/STT module 114 may be configured to perform oneor more functions: capture a user's speech, e.g., via a microphone(which in some cases may comprise presence sensor 105); convert thatcaptured audio to text (and/or to other representations or embeddings);and/or convert text to speech. For example, in some implementations,because a client device 106 may be relatively constrained in terms ofcomputing resources (e.g., processor cycles, memory, battery, etc.), thespeech capture/TTS/STT module 114 that is local to each client device106 may be configured to convert a finite number of different spokenphrases—particularly phrases that invoke automated assistant 120 and/orintercom-style communication—to text (or to other forms, such as lowerdimensionality embeddings). Other speech input may be sent tocloud-based automated assistant components 119, which may include acloud-based TTS module 116 and/or a cloud-based STT module 117.

In some implementations, components that contribute to implementation ofintercom-style communication as described herein may intentionally beoperated exclusively on one or more client devices 106 that areassociated with each other, for instance, by virtue of being on the sameLAN. In some such implementations, any machine learning models describedelsewhere herein may be trained and/or stored on one or more clientdevices 106, e.g., behind an Internet firewall, so that training dataand other information generated by or associated with the machinelearning models may be maintained in privacy. And in some suchimplementations, the cloud-based STT module 117, cloud-based TTS module116, and/or cloud-based aspects of natural language processor 122 maynot be involved in invocation of intercom-style communications.

Cloud-based STT module 117 may be configured to leverage the virtuallylimitless resources of the cloud to convert audio data captured byspeech capture/TTS/STT module 114 into text (which may then be providedto natural language processor 122). Cloud-based TTS module 116 may beconfigured to leverage the virtually limitless resources of the cloud toconvert textual data (e.g., natural language responses formulated byautomated assistant 120) into computer-generated speech output. In someimplementations, TTS module 116 may provide the computer-generatedspeech output to client device 106 to be output directly, e.g., usingone or more speakers. In other implementations, textual data (e.g.,natural language responses) generated by automated assistant 120 may beprovided to speech capture/TTS/STT module 114, which may then convertthe textual data into computer-generated speech that is output locally.

Automated assistant 120 (and in particular, cloud-based automatedassistant components 119) may include a natural language processor 122,the aforementioned TTS module 116, the aforementioned STT module 117,and other components, some of which are described in more detail below.In some implementations, one or more of the engines and/or modules ofautomated assistant 120 may be omitted, combined, and/or implemented ina component that is separate from automated assistant 120. And as notedabove, in some implementations, to protect privacy, one or more of thecomponents of automated assistant 120, such as natural languageprocessor 122, speech capture/TTS/STT module 114, etc., may beimplemented at least on part on client devices 106 (e.g., to theexclusion of the cloud). In some such implementations, speechcapture/TTS/STT module 114 may be sufficiently configured to performselected aspects of the present disclosure to enable intercom-stylecommunication, while in some cases leaving other, non-intercom-relatednatural language processing aspects to cloud-based components whensuitable.

In some implementations, automated assistant 120 generates responsivecontent in response to various inputs generated by a user of one of theclient devices 106 _(1-N) during a human-to-computer dialog session withautomated assistant 120. Automated assistant 120 may provide theresponsive content (e.g., over one or more networks when separate from aclient device of a user) for presentation to the user as part of thedialog session. For example, automated assistant 120 may generateresponsive content in response to free-form natural language inputprovided via one of the client devices 106 _(1-N). As used herein,free-form input is input that is formulated by a user and that is notconstrained to a group of options presented for selection by the user.

As used herein, a “dialog session” may include alogically-self-contained exchange of one or more messages between a userand automated assistant 120 (and in some cases, other humanparticipants). Automated assistant 120 may differentiate betweenmultiple dialog sessions with a user based on various signals, such aspassage of time between sessions, change of user context (e.g.,location, before/during/after a scheduled meeting, etc.) betweensessions, detection of one or more intervening interactions between theuser and a client device other than dialog between the user and theautomated assistant (e.g., the user switches applications for a while,the user walks away from then later returns to a standalonevoice-activated product), locking/sleeping of the client device betweensessions, change of client devices used to interface with one or moreinstances of automated assistant 120, and so forth.

Natural language processor 122 of automated assistant 120 processesnatural language input generated by users via client devices 106 _(1-N)and may generate annotated output for use by one or more othercomponents of automated assistant 120. For example, the natural languageprocessor 122 may process natural language free-form input that isgenerated by a user via one or more user interface input devices ofclient device 106 ₁. The generated annotated output includes one or moreannotations of the natural language input and optionally one or more(e.g., all) of the terms of the natural language input.

In some implementations, the natural language processor 122 isconfigured to identify and annotate various types of grammaticalinformation in natural language input. For example, the natural languageprocessor 122 may include a part of speech tagger configured to annotateterms with their grammatical roles. For example, the part of speechtagger may tag each term with its part of speech such as “noun,” “verb,”“adjective,” “pronoun,” etc. Also, for example, in some implementationsthe natural language processor 122 may additionally and/or alternativelyinclude a dependency parser (not depicted) configured to determinesyntactic relationships between terms in natural language input. Forexample, the dependency parser may determine which terms modify otherterms, subjects and verbs of sentences, and so forth (e.g., a parsetree)—and may make annotations of such dependencies.

In some implementations, the natural language processor 122 mayadditionally and/or alternatively include an entity tagger (notdepicted) configured to annotate entity references in one or moresegments such as references to people (including, for instance, literarycharacters, celebrities, public figures, etc.), organizations, locations(real and imaginary), and so forth. In some implementations, data aboutentities may be stored in one or more databases, such as in a knowledgegraph (not depicted). In some implementations, the knowledge graph mayinclude nodes that represent known entities (and in some cases, entityattributes), as well as edges that connect the nodes and representrelationships between the entities. For example, a “banana” node may beconnected (e.g., as a child) to a “fruit” node,” which in turn may beconnected (e.g., as a child) to “produce” and/or “food” nodes. Asanother example, a restaurant called “Hypothetical Café” may berepresented by a node that also includes attributes such as its address,type of food served, hours, contact information, etc. The “HypotheticalCafé” node may in some implementations be connected by an edge (e.g.,representing a child-to-parent relationship) to one or more other nodes,such as a “restaurant” node, a “business” node, a node representing acity and/or state in which the restaurant is located, and so forth.

The entity tagger of the natural language processor 122 may annotatereferences to an entity at a high level of granularity (e.g., to enableidentification of all references to an entity class such as people)and/or a lower level of granularity (e.g., to enable identification ofall references to a particular entity such as a particular person). Theentity tagger may rely on content of the natural language input toresolve a particular entity and/or may optionally communicate with aknowledge graph or other entity database to resolve a particular entity.

In some implementations, the natural language processor 122 mayadditionally and/or alternatively include a coreference resolver (notdepicted) configured to group, or “cluster,” references to the sameentity based on one or more contextual cues. For example, thecoreference resolver may be utilized to resolve the term “there” to“Hypothetical Café” in the natural language input “I liked HypotheticalCafé last time we ate there.”

In some implementations, one or more components of the natural languageprocessor 122 may rely on annotations from one or more other componentsof the natural language processor 122. For example, in someimplementations the named entity tagger may rely on annotations from thecoreference resolver and/or dependency parser in annotating all mentionsto a particular entity. Also, for example, in some implementations thecoreference resolver may rely on annotations from the dependency parserin clustering references to the same entity. In some implementations, inprocessing a particular natural language input, one or more componentsof the natural language processor 122 may use related prior input and/orother related data outside of the particular natural language input todetermine one or more annotations.

In various implementations, cloud-based automated assistant components119 may include an intercom communication analysis service (“ICAS”) 138and/or an intercom communication location service (“ICLS”) 140. In otherimplementations, services 138 and/or 140 may be implemented separatelyfrom cloud-based automated assistant components 119, e.g., on one ormore client devices 106 and/or on another computer system (e.g., in theso-called “cloud”).

In various implementations, ICAS 138 may be configured to determine,based on a variety of signals and/or data points, how and/or when tofacilitate intercom-style communication between multiple users usingmultiple client devices 106. For example, in various implementations,ICAS 138 may be configured to analyze voice input provided by a firstuser at a microphone of a client device 106 of a plurality of associatedclient devices 106 _(1-N). In various implementations, ICAS 138 mayanalyze the first user's voice input and determine, based on theanalysis, that the voice input contains a message intended for a seconduser.

Various techniques may be employed as part of the analysis to determinewhether the first user intended to convey a message to the second user.In some implementations, an audio recording of the first user's voiceinput may be applied as input across a trained machine learningclassifier to generate output. The output may indicate that the firstuser's voice input contained a message intended for the second user.Various types of machine learning classifiers (or more generally,“models”) may be trained to provide such output, including but notlimited to various types neural networks (e.g., feed-forward,convolutional, etc.).

In some implementations, labeled phonemes of user's utterances may beused to train a machine learning model such as a neural network to learnembeddings of utterances into lower dimensionality representations.These embeddings, which may include lower dimensionality representationsof the original phonemes, may then be used (e.g., as input for thetrained model) to identify when a user intends to use the intercom-stylecommunication described herein, and/or when a user's utterance containsa message intended for another person. For example, labeled utterancesmay be embedded into reduced dimensionality space, e.g., such that theyare clustered into groups associated with intercom-style communicationand not-intercom-style communication. The new, unlabeled utterance maythen be embedded, and may be classified based on which cluster it'sembedding is nearest (e.g., in Euclidian space).

In some implementations, a neural network (or other classifier) may betrained using training data in the form of a corpus of labelled userutterances (in which case the training is “supervised”). Labels appliedto the corpus of utterances may include, for instance, a first labelindicative of an utterance that contains a message intended for anotheruser, a second label indicative of a command to engage in ahuman-to-computer dialog with automated assistant 120, and/or a thirdlabel indicative of background noise (which may be ignored). Thelabelled training examples may be applied as input to an untrainedneural network. Differences between the output of the untrained (or notfully-trained) neural network and the labels—a.k.a. error—may bedetermined and used with techniques such as back propagation, stochasticgradient descent, objective function optimization, etc., to adjustvarious weights of one or more hidden layers of the neural network toreduce the error.

As noted in the background, machine learning classifiers such as neuralnetworks may be trained already to recognize (e.g., classify) phonemesor other audio characteristics of utterances that are intended to invokeautomated assistant 120. In some implementations, the same classifiermay be trained further to both recognize (e.g., classify) explicitinvocation of automated assistant 120, and to determine whether anutterance contains a message intended for a second user. In otherimplementations, separate machine learning classifiers may be used foreach of these two tasks, e.g., one after the other or in parallel.

In addition to determining that a captured (recorded) utterance containsa message intended for another user, it may be determined also whetherintercom-style communication is warranted, e.g., based on respectivelocations of the speaker and the intended recipient. In variousimplementations, ICLS 140 may determine a location of the intendedrecipient relative to the plurality of client devices 106 _(1-N), e.g.,using presence sensor(s) 105 associated with one or more of the clientdevices 106 _(1-N). For example, ICLS 140 may determine which clientdevice 106 is nearest the intended recipient, and/or which room theintended recipient is in (which in some cases may be associated with aclient device deployed in that room). Based on the location of theintended recipient determined by ICLS 140, in various implementations,ICAS 138 may select, from the plurality of client devices 106 _(1-N), asecond client device 106 that is capable of providing audio or visualoutput that is perceptible to the intended recipient. For example, ifthe intended recipient was last detected walking into a particular area,then a client device 106 nearest that area may be selected.

In some implementations, ICLS 140 may be provided, e.g., as part ofcloud-based automated assistant components 119 and/or separatelytherefrom. In other implementations, ICAS 138 and ICLS 140 may beimplemented together in a single model or engine. In variousimplementations, ICLS 140 may be configured to track locations ofpersons within an area of interest, such as within a home, a workplace,a campus, etc., based on signals provided by, for example, presencesensors 105 integral with a plurality of client devices 106 _(1-N) thatare distributed throughout the area. Based on these tracked locations,ICLS 140 and/or ICAS 138 may be configured to facilitate intercom-stylecommunication between persons in the area using the plurality of clientdevices 106 _(1-N) as described herein.

In some implementations, ICLS 140 may create and/or maintain a list ordatabase of persons located in a particular area, and/or their lastknown locations relative to a plurality of client devices 106 _(1-N)deployed in the area. In some implementations, this list/database may beupdated, e.g., in real time, as persons are detected by different clientdevices as having moved to different locations. For example, ICLS 140may drop a particular person from the list/database if, for example,that person is not detected in the overall area for some predeterminedtime interval (e.g., one hour) and/or if the person is last detectedpassing through an ingress or egress area (e.g., front door, back door,etc.). In other implementations, ICLS 140 may update the list/databaseperiodically, e.g., every few minutes, hours, etc.

In some implementations, ICAS 138 and/or ICLS 140 (and more generally,automated assistant 120) may be configured to distinguish betweendifferent people using signals from presence sensors 105, rather thansimply detect presence of a generic person. For example, suppose aclient device 106 includes a microphone as a presence sensor 105.Automated assistant 120 may be configured to use a variety of speakerrecognition and/or voice recognition techniques to determine not onlythat someone is present nearby, but who is present. These speakerrecognition and/or voice recognition techniques may include but are notlimited to hidden Markov models, Gaussian mixture models, frequencyestimation, trained classifiers, deep learning, pattern matchingalgorithms, matrix representation, vector quantization, decisions trees,etc.

If a person near a microphone-equipped client device 106 does not happento be speaking, then other techniques may be employed to identify theperson. Suppose a client device 106 includes, as a presence sensor 105,a camera and/or a PIR sensor. In some implementations, a machinelearning visual recognition classifier may be trained using labelledtraining data captured by such a presence sensor 105 to recognize theperson visually. In some implementations, a user may cause the visualrecognition classifier to be trained by invoking a training routine atone or more camera/PIR sensor-equipped client devices 106. For example,a user may stand in a field of view of presence sensor 105 and invokeautomated assistant 120 with a phrase such as “Hey Assistant, I am Janand this is what I look like.” In some implementations, automatedassistant 120 may provide audible or visual output that prompts the userto move around to various positions within a field of view of presencesensor 105, while presence sensor 105 captures one or more snapshots ofthe user. These snapshots may then be labelled (e.g., with “Jan”) andused as labelled training examples for supervised training of the visualrecognition classifier. In other implementations, labelled trainingexamples for visual recognition may be generated automatically, e.g.,without the user being aware. For example, when the user is in a fieldof view of presence sensor 105, a signal (e.g., radio wave, ultrasonic)emitted by a mobile client device 106 carried by the user may beanalyzed, e.g., by automated assistant 120, to determine the user'sidentity (and hence, a label) for snapshots captured by presence sensor105.

And in yet other implementations, other types of cues besides audioand/or visual cues may be employed to distinguish uses from one another.For example, radio, ultrasonic, and/or other types of wireless signals(e.g., infrared, modulated light, etc.) emitted by client devices 106carried by users may be analyzed, e.g., by automated assistant 120, todiscern an identity of a nearby user. In some implementations, a user'smobile client device 106 may include a network identifier, such as“Jan's Smartphone,” that may be used to identify the user.

Referring now to FIG. 1B, an example data flow is depicted schematicallyto demonstrate one possible way in which a trained machine learningclassifier may be applied to analyze user utterances and determine,among other things, whether to employ intercom-style communication. InFIG. 1B, a phoneme classifier 142 (which may be a component of automatedassistant 120) may be trained such that one or more utterances and oneor more person locations may be applied across phoneme classifier 142 asinput. Phoneme classifier 142 may then generate, as output, aclassification of the utterance(s). In FIG. 1B, these classificationsinclude “invoke assistant,” “convey message,” and “background noise,”but additional and/or alternative labels are possible.

Conventional phoneme classifiers already exist that detect explicitinvocation phrases such as “Hey, Assistant,” “OK Assistant,” etc. Insome implementations, phoneme classifier 142 may include the samefunctionality such that when an input utterance includes such aninvocation phrase, the output of phoneme classifier 142 is “invokeassistant.” Once automated assistant 120 is invoked, the user may engagein human-to-computer dialog with automated assistant 120 as is known inthe art.

However, in some implementations, phoneme classifier 142 may be furthertrained to recognize other phonemes that signal a user intent to conveya message to another user. For example, users may often use phrases suchas “Hey, <name>” to get another person's attention. More generally,phoneme classifier 142 may operate to match custom phrases, words, etc.Additionally or alternatively, to get another person's attention, it maybe common to first speak the other person's name, sometimes in aslightly elevated volume and/or with particular intonations, or to useother types of intonations. In various implementations, phonemeclassifier 142 may be trained to recognize such phonemes and generateoutput such as “convey message” to signal a scenario in whichintercom-style communication may potentially be warranted. In variousimplementations, a separate intonation model may optionally beseparately trained to recognize utterances that seek communication withanother person (e.g., to differentiate such utterances from casualutterances) and generate output that indicates the presence of suchutterances (e.g., a likelihood that such an utterance is present). Theoutputs from the phoneme classifier and the intonation model, for agiven user utterance, may be collectively considered in determining ifintercom-style communication may be warranted.

In some implementations, one or more person locations may be provided,e.g., by ICLS 140, as input to phoneme classifier 142. These personlocations may be used, in addition to or instead of the utterance(s), todetermine whether intercom-style communication is warranted. Forexample, if the recipient location is sufficiently near (e.g., withinearshot of) a speaker's location, that may influence phoneme classifier142 to produce output such as “background noise,” even if the utterancecontains a message intended for another. On the other hand, suppose theintended recipient's location is out of earshot of the speaker'slocation. That may influence phoneme classifier 142 to produce outputsuch as “convey message,” which may increase a likelihood thatintercom-style communication is employed. Additionally or alternatively,a two-step approach may be implemented in which it is first determinedwhether a speaker's utterance contains a message intended for anotheruser, and it is then determined whether the other user is within earshotof the speaker. If the answer to both questions is yes, thenintercom-style communication may be implemented to convey the message tothe intended recipient.

Referring now to FIG. 2, a home floorplan is depicted that includes aplurality of rooms, 250-262. A plurality of client devices 206 ₁₋₄ aredeployed throughout at least some of the rooms. Each client device 206may implement an instance of automated assistant client 118 configuredwith selected aspects of the present disclosure and may include one ormore input devices, such as microphones, that are capable of capturingutterances spoken by a person nearby. For example, a first client device206 ₁ taking the form of a standalone interactive speaker is deployed inroom 250, which in this example is a kitchen. A second client device 206₂ taking the form of a so-called “smart” television (e.g., a networkedtelevision with one or more processors that implement an instance ofautomated assistant client 118) is deployed in room 252, which in thisexample is a den. A third client device 206 ₃ taking the form of aninteractive standalone speaker is deployed in room 254, which in thisexample is a bedroom. A fourth client device 206 ₄ taking the form ofanother interactive standalone speaker is deployed in room 256, which inthis example is a living room.

While not depicted in FIG. 2, the plurality of client devices 106 ₁₋₄may be communicatively coupled with each other and/or other resources(e.g., the Internet) via one or more wired or wireless LANs (e.g., 110 ₂in FIG. 1A). Additionally, other client devices—particularly mobiledevices such as smart phones, tablets, laptops, wearable devices,etc.—may also be present, e.g., carried by one or more persons in thehome and may or may not also be connected to the same LAN. It should beunderstood that the configuration of client devices depicted in FIG. 2and elsewhere in the Figures is just one example; more or less clientdevices 106 may be deployed across any number of other rooms and/orareas other than a home.

In the example of FIG. 2, a first user, Jack, is in the kitchen 250 whenhe utters the question, “Hey Hon, do you know where the strainer is?”Perhaps unbeknownst to Jack, his wife, Jan, is not in kitchen 250, butrather is in living room 256, and therefore likely did not hear Jack'squestion. First client device 206 ₁, which as noted above is configuredwith selected aspects of the present disclosure, may detect Jack'sutterance. A recording of the utterance may be analyzed using techniquesdescribed above to determine that Jack's utterance contains a messageintended for Jan. First client device 206 ₁ also may determine, e.g.,based on information shared amongst all of the plurality of clientdevices 206 ₁₋₄, that Jan is in living room 256 (or at least nearestfourth client device 206 ₄). For example, client device 206 ₄ may havedetected, e.g., using one or more integral presence sensors (e.g., 105in FIG. 1A), that Jan is in living room 256.

Based on Jan's detected location and/or on attribute(s) of Jack'sutterance (which in some implementations may be classified using atrained machine learning model as described above), first client device206 ₁ may determine that Jack intended his message for Jan and that Janis out of earshot of Jack. Consequently, first client device 206 ₁ maypush (over one or more of the aforementioned LANs) a recording of Jack'sutterance (or in some cases, transcribed text of Jack's utterance) tothe client device nearest Jan, which in this example is fourth clientdevice 206 ₄. On receiving this data, fourth client device 206 ₄ may,e.g., by way of automated assistant 120 executing at least in part onfourth client device 206 ₄, audibly output Jack's message to Jan asdepicted in FIG. 2, thus effecting intercom-style communication betweenJack and Jan.

In the example of FIG. 2 (and in similar examples described elsewhereherein), Jack's question is output to Jan audibly using fourth clientdevice 206 ₄, which as noted above is a standalone interactive speaker.However, this is not meant to be limiting. In various implementations,Jack's message may be conveyed to Jan using other output modalities. Forexample, in some implementations in which a mobile client device (notdepicted) carried by Jan is connected to the Wi-Fi LAN, that mobiledevice may output Jack's message, either as an audible recording or as atextual message that is conveyed to Jan visually, e.g., using anapplication such as message exchange client 107 executing on Jan'smobile client device.

In various implementations, recordings and/or STT transcriptions ofutterances that are exchanged between client devices 106 to facilitateintercom communication may be used for a variety of additional purposes.In some embodiments, they may be used to provide context to downstreamhuman-to-computer dialogs between user(s) and automated assistant 120.For example, in some scenarios, a recorded utterance and/or its STTtranscription may be used to disambiguate a request provided to aninstance of automated assistant 120, whether that request be from theuser who originally provided the utterance, an intended recipient of theutterance, or even another user who engages automated assistant 120subsequent to an intercom-style communication involving a plurality ofclient devices 106.

FIG. 3 depicts the same home and distribution of client devices 206 ₁₋₄as was depicted in FIG. 2. In FIG. 3, Jan (still in living room 256)speaks the utterance, “Hey Jack, you should leave soon to pick up Bobfrom the airport.” It may be determined, e.g., by ICLS 140, that Jack isin another room, out of earshot from Jan. For example, ICLS 140 maydetermine, e.g., based on a signal provided by an onboard camera and/orPIR sensor of a “smart” thermostat 264, that Jack is located in den 252.Based on that determination, and/or a determination that Jan's utterancehas been classified (e.g., using one of the aforementioned machinelearning models) as a message intended for Jack, a client device nearJack's detected location, such as client device 206 ₂, may be identifiedto output Jan's utterance. In some implementations, Jan's recordedutterance may be pushed from another computing device near Jan thatrecorded it, such as client device 206 ₄, to client device 206 ₂identified near Jack and output audibly (or visually since client device206 ₂ is a smart television with display capabilities).

FIG. 4 demonstrates an example follow up scenario to that depicted inFIG. 3. After receiving Jan's conveyed message via client device 206 ₂,Jack says “OK Assistant—when is the next tram leaving?” Withoutadditional information, this request, or search query, may be tooambiguous to answer, and automated assistant 120 may be required tosolicit disambiguating information from Jack. However, using techniquesdescribed herein, automated assistant 120 may disambiguate Jack'srequest based on Jan's original utterance to determine that the tram tothe airport is the one Jack is interested in. Additionally oralternatively, automated assistant 120 could simply retrieve normalresults for all nearby trams, and then rank those results based onJack's utterance, e.g., so that the tram to the airport is rankedhighest. Whichever the case, in FIG. 4, automated assistant 120 providesaudio output at client device 206 ₂ of “Next tram to the airport leavesin 10 minutes.”

FIG. 5 is a flowchart illustrating an example method 500 according toimplementations disclosed herein. For convenience, the operations of theflow chart are described with reference to a system that performs theoperations. This system may include various components of variouscomputer systems, such as one or more components of computing systemsthat implement automated assistant 120. Moreover, while operations ofmethod 500 are shown in a particular order, this is not meant to belimiting. One or more operations may be reordered, omitted or added.

At block 502, the system may receive, at an input device of a firstcomputing device of a plurality of computing devices, from a first user,free form natural language input. In many implementations, this freeform natural language input may come in the form of voice input, i.e.,an utterance from the first user, though this is not required. It shouldbe understood that this voice input need not necessarily be directed bythe first user at automated assistant 120, and instead may include anyutterance provided by the first user that is captured and/or recorded bya client device configured with selected aspects of the presentdisclosure.

At block 504, the system may analyze the voice input. Various aspects(e.g., phonemes) of the voice input may be analyzed, including but notlimited to intonation, volume, recognized phrases, etc.

In some implementations, the system may analyze other signals inaddition to the voice input. These other signals may include, forinstance, a number of people in an environment such as a house. Forinstance, if only one person is present, no intercom capabilities may beutilized. If only two people are present, then the location of the otherperson may automatically be determined to be the location at whichintercom output should be provided. If more than two people are present,then the system may attempt various techniques (e.g., voice recognition,facial recognition, wireless signal recognition, etc.) to attempt todistinguish people from each other.

At block 506, the system may determine, based on the analyzing, that thefirst user intends to convey a message to a second user (e.g., that thevoice input contains a message intended for the second user). Asdescribed previously, automated assistant 120 may employ varioustechniques, such as a classifier trained on labeled training examples inthe form of recorded utterances, to analyze the voice input and/ordetermine whether the first user's voice input is a command to invokeautomated assistant 120 (e.g., “Hey, Assistant”) to engage in furtherhuman-to-computer dialog, an utterance intended to convey a message tothe second user (or multiple other users), or other background noise. Insome implementations, in addition to or instead of a trained machinelearning model, a rules-based approach may be implemented. For example,one or more simple IVR grammars may be defined, e.g., using technologiessuch as voice extensible markup language, or “VXML,” that are designedto match utterances that are intended to convey messages between users.

At block 508 (which may occur after blocks 502-506 or on an ongoingbasis), the system may determine a location of the second user relativeto the plurality of computing devices. In some embodiments, ICLS 140 maymaintain a list or database of people in an area such as a home orworkplace and their last-known (i.e., last-detected) locations. In somesuch implementations, the system may simply consult this list ordatabase for the second user's location. In other implementations, thesystem may actively poll a plurality of client devices in theenvironment to seek out the second user, e.g., on an as-needed basis(e.g., when it is determined that the first user's utterance contains amessage intended to be conveyed to the second user). This may cause theclient devices to activate presence sensors (105 in FIG. 1A) so thatthey can detect whether someone (e.g., the second user) is nearby.

At block 510 the system may select, from the plurality of computingdevices, based on the location of the second user, a second computingdevice that is capable of providing audio or visual output that isperceptible to the second user. In some implementations, the secondcomputing device may be a stationary client device (e.g., a standaloneinteractive speaker, smart television, desktop computer, etc.) that isdeployed in a particular area of an environment. In otherimplementations, the second computing device may be a mobile clientdevice carried by the second user. In some such implementations, themobile client device may become part of the plurality of computingdevices considered by the system by virtue of being part of the samecoordinate ecosystem and/or joining the same wireless LAN (or simplybeing located within a predetermined distance).

At block 512, the system may cause the second computing deviceidentified at block 510 to provide audio or visual output that conveysthe message to the second user. For example, in some implementations inwhich ICAS 138 and/or ICLS 140 are cloud-based, one or the other maycause a recording of the first user's utterance to be forwarded (e.g.,streamed) to the second computing device selected at block 510. Thesecond computing device, which may be configured with selected aspectsof the present disclosure, may respond by outputting the forwardedrecording.

While examples described herein have included a first user attempting toconvey a message to a single other user, this is not meant to belimiting. In various implementations, a user's utterance may be intendedfor multiple other users, such as multiple members of the speaker'sfamily, all persons in the area or environment, etc. In some suchimplementations, one or more of the aforementioned machine learningclassifiers may be (further) trained to determine whether an utterancecontains a message intended for a single recipient or multiplerecipients. If the answer is yes, then the system may convey the messageto the multiple intended recipients at multiple locations in variousways. In some simple implementations, the system may simply cause themessage to be pushed to all client devices in the area (e.g., all clientdevices of a coordinated ecosystem and/or all client devices connectedto a Wi-Fi LAN), effectively broadcasting the message. In otherimplementations, the system (e.g., ICLS 140) may determine locations ofall intended recipients on an individual basis, and output the messageon only those client devices that are near each intended recipient.

In some implementations, automated assistant 120 may wait until anintended recipient of a message is able to perceive a message (e.g.,within earshot) until it causes a message to be conveyed usingtechniques described herein. For example, suppose a first user conveys amessage to an intended recipient but the intended recipient has steppedoutside momentarily. In some implementations, the message may betemporarily delayed until the intended recipient is detected by one ormore computing devices. The first computing device to detect therecipient upon their return may output the original message. In someimplementations, a variety of signals, such as intended recipient'sposition coordinate (e.g., Global Positioning System, or “GPS”) obtainedfrom a mobile device they carry, may be used to determine that theintended recipient will not be reachable using intercom-stylecommunication (at least not with any devices on the LAN). In someimplementations, the message (e.g., a recording of the speaker'sutterance) may be forwarded to the recipient's mobile device. In otherimplementations, automated assistant 120 may determine that the intendedrecipient is unreachable, and may provide output, e.g., at a clientdevice closest to the speaker (e.g., the device that captured thespeaker's utterance) that notifies the speaker that the recipient isunreachable at the moment. In some such implementations, automatedassistant 120 may prompt the user for permission to forward the messageto the recipient's mobile device, e.g., by output something like “Ican't reach Jan directly right now. Would you like me to send a messageto their phone?”

Optional blocks 514-518 may or may not occur if the second user issuesfree form natural language input sometime after the first user's messageis output at block 512. At block 514, the system may identify a freeform natural language input, such as a voice input, that is issued bythe second user after the audio or visual output is provided by thesecond computing device at block 512. The second user's voice input mayinclude, for instance, a command and/or a search query. In someembodiments, the command and/or search query may be, by itself, tooambiguous to properly interpret, as was the case with Jack's utterancein FIG. 4.

At block 516, the system may analyze the second user's free form naturallanguage input identified at block 514 based at least in part on thefree form natural language input received from the first user at block502. In other words, the first user's utterance may be transcribed andused to provide context to the second user's subsequent request. Atblock 518, the system may formulate a response to the second user'snatural language input based on the context provided by the first user'soriginal free form natural language input. For example, if the seconduser's free form natural language input included a search query (such asJack's query in FIG. 4), the system may obtain search results that areresponsive to the search query based at least in part on the voice inputfrom the first user. For example, the second user's search query may bedisambiguated based on the first user's original utterance, and/or oneor more responsive search results may be ranked based on the firstuser's original utterance. At block 518, the system may then cause oneor more of the plurality of computing devices to provide outputindicative of at least some of the search results, as occurred in FIG.4.

In various implementations, users may preconfigure (e.g., commission)client computing devices in their home, workplace, or in anotherenvironment, to be usable to engage in intercom-style communications asdescribed herein. For example, in some implementations, a user may,e.g., using a graphical user interface and/or by engaging in ahuman-to-computer dialog session with automated assistant 120, assign a“location” to each stationary client computing device, such as“kitchen,” “dining room,” etc. Consequently, in some suchimplementations, a user may explicitly invoke automated assistant 120 tofacilitate intercom-style communication to a particular location. Forexample, a user may provide the following voice input to convey amessage to another user: “Hey Assistant, tell Oliver in the kitchen thatwe need more butter.”

More generally, in some implementations, users may explicitly designatea recipient of a message when they invoke intercom-style communication.If the user does not also specify a location of the recipient, thentechniques described herein, e.g., in association with ICLS 140, may beused automatically to determine a location of the recipient and selectwhich computing device will be used to output the message to therecipient. However, as described above, a user need not explicitlyinvoke intercom-style communications. Rather, as described above,various signals and/or data points (e.g., output of a machine learningclassifier, location of an intended recipient, etc.) may be consideredto determine, without explicit instruction from the user, that theuser's message should be convey automatically using intercom-stylecommunication.

FIG. 6 is a block diagram of an example computing device 610 that mayoptionally be utilized to perform one or more aspects of techniquesdescribed herein. In some implementations, one or more of a clientcomputing device, user-controlled resources engine 130, and/or othercomponent(s) may comprise one or more components of the examplecomputing device 610.

Computing device 610 typically includes at least one processor 614 whichcommunicates with a number of peripheral devices via bus subsystem 612.These peripheral devices may include a storage subsystem 624, including,for example, a memory subsystem 625 and a file storage subsystem 626,user interface output devices 620, user interface input devices 622, anda network interface subsystem 616. The input and output devices allowuser interaction with computing device 610. Network interface subsystem616 provides an interface to outside networks and is coupled tocorresponding interface devices in other computing devices.

User interface input devices 622 may include a keyboard, pointingdevices such as a mouse, trackball, touchpad, or graphics tablet, ascanner, a touchscreen incorporated into the display, audio inputdevices such as voice recognition systems, microphones, and/or othertypes of input devices. In general, use of the term “input device” isintended to include all possible types of devices and ways to inputinformation into computing device 610 or onto a communication network.

User interface output devices 620 may include a display subsystem, aprinter, a fax machine, or non-visual displays such as audio outputdevices. The display subsystem may include a cathode ray tube (CRT), aflat-panel device such as a liquid crystal display (LCD), a projectiondevice, or some other mechanism for creating a visible image. Thedisplay subsystem may also provide non-visual display such as via audiooutput devices. In general, use of the term “output device” is intendedto include all possible types of devices and ways to output informationfrom computing device 610 to the user or to another machine or computingdevice.

Storage subsystem 624 stores programming and data constructs thatprovide the functionality of some or all of the modules describedherein. For example, the storage subsystem 624 may include the logic toperform selected aspects of the method of FIG. 5, as well as toimplement various components depicted in FIG. 1A.

These software modules are generally executed by processor 614 alone orin combination with other processors. Memory 625 used in the storagesubsystem 624 can include a number of memories including a main randomaccess memory (RAM) 630 for storage of instructions and data duringprogram execution and a read only memory (ROM) 632 in which fixedinstructions are stored. A file storage subsystem 626 can providepersistent storage for program and data files, and may include a harddisk drive, a floppy disk drive along with associated removable media, aCD-ROM drive, an optical drive, or removable media cartridges. Themodules implementing the functionality of certain implementations may bestored by file storage subsystem 626 in the storage subsystem 624, or inother machines accessible by the processor(s) 614.

Bus subsystem 612 provides a mechanism for letting the variouscomponents and subsystems of computing device 610 communicate with eachother as intended. Although bus subsystem 612 is shown schematically asa single bus, alternative implementations of the bus subsystem may usemultiple busses.

Computing device 610 can be of varying types including a workstation,server, computing cluster, blade server, server farm, or any other dataprocessing system or computing device. Due to the ever-changing natureof computers and networks, the description of computing device 610depicted in FIG. 6 is intended only as a specific example for purposesof illustrating some implementations. Many other configurations ofcomputing device 610 are possible having more or fewer components thanthe computing device depicted in FIG. 6.

In situations in which certain implementations discussed herein maycollect or use personal information about users (e.g., user dataextracted from other electronic communications, information about auser's social network, a user's location, a user's time, a user'sbiometric information, and a user's activities and demographicinformation, relationships between users, etc.), users are provided withone or more opportunities to control whether information is collected,whether the personal information is stored, whether the personalinformation is used, and how the information is collected about theuser, stored and used. That is, the systems and methods discussed hereincollect, store and/or use user personal information only upon receivingexplicit authorization from the relevant users to do so.

For example, a user is provided with control over whether programs orfeatures collect user information about that particular user or otherusers relevant to the program or feature. Each user for which personalinformation is to be collected is presented with one or more options toallow control over the information collection relevant to that user, toprovide permission or authorization as to whether the information iscollected and as to which portions of the information are to becollected. For example, users can be provided with one or more suchcontrol options over a communication network. In addition, certain datamay be treated in one or more ways before it is stored or used so thatpersonally identifiable information is removed. As one example, a user'sidentity may be treated so that no personally identifiable informationcan be determined. As another example, a user's geographic location maybe generalized to a larger region so that the user's particular locationcannot be determined.

While several implementations have been described and illustratedherein, a variety of other means and/or structures for performing thefunction and/or obtaining the results and/or one or more of theadvantages described herein may be utilized, and each of such variationsand/or modifications is deemed to be within the scope of theimplementations described herein. More generally, all parameters,dimensions, materials, and configurations described herein are meant tobe exemplary and that the actual parameters, dimensions, materials,and/or configurations will depend upon the specific application orapplications for which the teachings is/are used. Those skilled in theart will recognize, or be able to ascertain using no more than routineexperimentation, many equivalents to the specific implementationsdescribed herein. It is, therefore, to be understood that the foregoingimplementations are presented by way of example only and that, withinthe scope of the appended claims and equivalents thereto,implementations may be practiced otherwise than as specificallydescribed and claimed. Implementations of the present disclosure aredirected to each individual feature, system, article, material, kit,and/or method described herein. In addition, any combination of two ormore such features, systems, articles, materials, kits, and/or methods,if such features, systems, articles, materials, kits, and/or methods arenot mutually inconsistent, is included within the scope of the presentdisclosure.

What is claimed is:
 1. A method comprising: accessing a trained machinelearning model, wherein the machine learning model is trained, using acorpus of labeled voice inputs, to predict whether voice inputs areindicative of background conversation that should be ignored, or areindicative of a user intent to convey a message to one or more otherusers; receiving, at a microphone of a first computing device of aplurality of computing devices, from a first user, voice input;analyzing the voice input, wherein the analyzing includes applying dataindicative of an audio recording of the voice input as input across thetrained machine learning model to generate output, wherein the outputindicates that the first user intends to convey a message to the one ormore other users; determining, based on the analyzing, that the firstuser intends to convey the message to the one or more other users; andcausing one or more other computing devices of the plurality ofcomputing devices to provide audio or visual output that conveys themessage to the one or more other users.
 2. The method of claim 1,further comprising: receiving, at the microphone of the first computingdevice, an additional voice input; analyzing the additional voice input,wherein the analyzing includes applying data indicative of an audiorecording of the voice input as input across the trained machinelearning model to generate additional output, wherein the additionaloutput indicates that the additional voice input is indicative ofbackground noise that should be ignored; ignoring the additional voiceinput in response to the additional output indicating that theadditional voice input is indicative of background noise that should beignored.
 3. The method of claim 1, wherein the machine learning model istrained using a corpus of labeled voice inputs, and wherein labelsapplied to the voice inputs include: a first label indicative of a userintent to convey a message to one or more other users; and a secondlabel indicative of background conversation between multiple users. 4.The method of claim 3, wherein the labels applied to the voice inputsfurther include a third label indicative of a user intent to engage in ahuman-to-computer dialog with an automated assistant.
 5. The method ofclaim 1, further comprising: determining a location of a second user ofthe one or more users relative to the plurality of computing devices;and selecting, from the plurality of computing devices, based on thelocation of the second user, a second computing device that is capableof providing audio or visual output that is perceptible to the seconduser; wherein the causing includes causing the second computing deviceprovide the audio or visual output that conveys the message to thesecond user.
 6. The method of claim 1, wherein the causing comprisingbroadcasting the message to the one or more other users using all of theplurality of computing devices.
 7. The method of claim 1, wherein theanalyzing includes performing speech-to-text processing on an audiorecording of the voice input to generate, as the data indicative of theaudio recording, textual input, wherein the textual input is applied asinput across the trained machine learning.
 8. A system comprising one ormore processors and memory operably coupled with the one or moreprocessors, wherein the memory stores instructions that, in response toexecution of the instructions by one or more processors, cause the oneor more processors to perform the following operations: accessing atrained machine learning model, wherein the machine learning model istrained, using a corpus of labeled voice inputs, to predict whethervoice inputs are indicative of background conversation that should beignored, or are indicative of a user intent to convey a message to oneor more other users; receiving, at a microphone of a first computingdevice of a plurality of computing devices, from a first user, voiceinput; analyzing the voice input, wherein the analyzing includesapplying data indicative of an audio recording of the voice input asinput across the trained machine learning model to generate output,wherein the output indicates that the first user intends to convey amessage to the one or more other users; determining, based on theanalyzing, that the first user intends to convey the message to the oneor more other users; and causing one or more other computing devices ofthe plurality of computing devices to provide audio or visual outputthat conveys the message to the one or more other users.
 9. The systemof claim 8, further comprising: receiving, at the microphone of thefirst computing device, an additional voice input; analyzing theadditional voice input, wherein the analyzing includes applying dataindicative of an audio recording of the voice input as input across thetrained machine learning model to generate additional output, whereinthe additional output indicates that the additional voice input isindicative of background noise that should be ignored; ignoring theadditional voice input in response to the additional output indicatingthat the additional voice input is indicative of background noise thatshould be ignored.
 10. The system of claim 8, wherein the machinelearning model is trained using a corpus of labeled voice inputs, andwherein labels applied to the voice inputs include: a first labelindicative of a user intent to convey a message to one or more otherusers; and a second label indicative of background conversation betweenmultiple users.
 11. The system of claim 10, wherein the labels appliedto the voice inputs further include a third label indicative of a userintent to engage in a human-to-computer dialog with an automatedassistant.
 12. The system of claim 8, further comprising: determining alocation of a second user of the one or more users relative to theplurality of computing devices; and selecting, from the plurality ofcomputing devices, based on the location of the second user, a secondcomputing device that is capable of providing audio or visual outputthat is perceptible to the second user; wherein the causing includescausing the second computing device provide the audio or visual outputthat conveys the message to the second user.
 13. The system of claim 8,wherein the causing comprising broadcasting the message to the one ormore other users using all of the plurality of computing devices. 14.The system of claim 8, wherein the analyzing includes performingspeech-to-text processing on an audio recording of the voice input togenerate, as the data indicative of the audio recording, textual input,wherein the textual input is applied as input across the trained machinelearning.
 15. At least one non-transitory computer-readable mediumcomprising instructions that, in response to execution of theinstructions by one or more processors, cause the one or more processorsto perform the following operations: accessing a trained machinelearning model, wherein the machine learning model is trained, using acorpus of labeled voice inputs, to predict whether voice inputs areindicative of background conversation that should be ignored, or areindicative of a user intent to convey a message to one or more otherusers; receiving, at a microphone of a first computing device of aplurality of computing devices, from a first user, voice input;analyzing the voice input, wherein the analyzing includes applying dataindicative of an audio recording of the voice input as input across thetrained machine learning model to generate output, wherein the outputindicates that the first user intends to convey a message to the one ormore other users; determining, based on the analyzing, that the firstuser intends to convey the message to the one or more other users; andcausing one or more other computing devices of the plurality ofcomputing devices to provide audio or visual output that conveys themessage to the one or more other users.
 16. The at least onenon-transitory computer-readable medium of claim 15, further comprisinginstructions for: receiving, at the microphone of the first computingdevice, an additional voice input; analyzing the additional voice input,wherein the analyzing includes applying data indicative of an audiorecording of the voice input as input across the trained machinelearning model to generate additional output, wherein the additionaloutput indicates that the additional voice input is indicative ofbackground noise that should be ignored; ignoring the additional voiceinput in response to the additional output indicating that theadditional voice input is indicative of background noise that should beignored.
 17. The at least one non-transitory computer-readable medium ofclaim 15, wherein the machine learning model is trained using a corpusof labeled voice inputs, and wherein labels applied to the voice inputsinclude: a first label indicative of a user intent to convey a messageto one or more other users; and a second label indicative of backgroundconversation between multiple users.
 18. The at least one non-transitorycomputer-readable medium of claim 17, wherein the labels applied to thevoice inputs further include a third label indicative of a user intentto engage in a human-to-computer dialog with an automated assistant. 19.The at least one non-transitory computer-readable medium of claim 15,further comprising: determining a location of a second user of the oneor more users relative to the plurality of computing devices; andselecting, from the plurality of computing devices, based on thelocation of the second user, a second computing device that is capableof providing audio or visual output that is perceptible to the seconduser; wherein the causing includes causing the second computing deviceprovide the audio or visual output that conveys the message to thesecond user.
 20. The at least one non-transitory computer-readablemedium of claim 15, wherein the causing comprising broadcasting themessage to the one or more other users using all of the plurality ofcomputing devices.